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ABSTRACT 


We explored’ how learners’ subjective ratings of open edu- 
cational resources (OERs) in terms of how much they find 
them “helpful” can predict the actual learning gains associ- 
ated with those resources as measured with pre- and post- 
tests. To this end, we developed a probabilistic model called 
GRAM (Gaussian Rating Aggregation Model) that com- 
bines subjective ratings from multiple learners into an ag- 
gregate quality score of each resource. Based on an exper- 
iment we conducted on Mechanical Turk (n = 304 par- 
ticipants with m = 17 math tutorial videos as resources), 
we found that aggregated subjective ratings are highly (and 
stat. sig.) predictive of the resources’ average learning gains, 
with Pearson correlation of 0.78. Moreover, when predict- 
ing average learning gains of new learners, subjective scores 
were still predictive (Pearson correlation of 0.49) and at- 
tained higher prediction accuracy than a model that di- 
rectly uses pre- and post-test data to estimate learning gains 
for each resource. These results have potential implica- 
tions for large-scale learning platforms (e.g., MOOCs, Khan 
Academy) that assign resources (tutorials, explanations, hints, 
etc.) to learners based on the expected learning gains. 
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1. INTRODUCTION 


Consider a hypothetical large-scale online learning platform 
in which learners engage with open educational resources 
(OERs) that are sampled from a vast collection. These re- 
sources could include tutorial videos, practice exercises, ex- 
planations of wrong answers, hints, etc. In order to help 
students learn optimally, the learning platform must decide 
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Figure 1: An adaptive learning community in which each 
learner 7 is assigned different resources over time, and the 
effectiveness (expected learning gains l;;) of each resource j 
is estimated both from test scores as well as from subjective 
ratings si; given by the learners. Gray lines show hypothet- 
ical assignments of OERs to learners. E,[l;;] denotes the 
average learning gains over all learners i who received j. 


which resource is most beneficial to each learner at each mo- 
ment in time, and then assign that resource to the learner 
(see Figure 1). Although various criteria could be used for 
this decision (e.g., the impact on student engagement), per- 
haps the most natural one is how much the student will learn 
— learning gains — from receiving the resource. 


The standard way to estimate the learning gains l;; of each 
resource is to give each student 7 who receives resource j a 
pre-test (before receiving it) and post-test (after receiving it) 
to measure how much she/he learned, i.e., the difference be- 
tween pre- and post-tests. We call each (learner, resource)- 
pair an assignment. After a sufficient number of assign- 
ments, the average learning gains of each resource E;[l;;] 
(averaged over all learners i who receive j) can be estimated. 
Then, using these estimates for all the resources, the most 
effective ones can be served to students. Unfortunately, this 
approach to estimating the quality of a large collection of 
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OERs is expensive because testing takes a long time. On 
the other hand, after receiving a resource j, learners may 
have a subjective opinion about how effective 7 was. These 
opinions can arguably be queried more easily and efficiently 
than administering tests; for example, the learner could sim- 
ply select between 1 and 5 stars (& la Yelp) to express how 
much she/he liked it. It is even possible that subjective 
scores might be better than test scores in some situations. 
For example, even if a learner her/himself has already mas- 
tered a skill and thus has a learning gain of 0, she/he might 
still be able to judge whether a resource is useful. 


When using subjective scores to predict learning gains, care 
must be taken: some learners may be more or less reliable 
in making such judgments. However, there are reasons to 
be optimistic: (1) As long as enough learners “vote”, then 
the noise of their judgments can be averaged out. (2) Using 
algorithms for crowdsourcing consensus (see below), the re- 
liabilities of the learners as well as the learning gains of the 
resources can be estimated in an unsupervised fashion. The 
chief contribution of our work is to propose and evaluate 
experimentally an efficient crowdsourcing model to estimate 
the quality of a set of learning resources by combining mul- 
tiple learners’ subjective opinions about them. 


2. RELATED WORK 


Students’ judgments of learning and teaching: Esti- 
mating the learning gains of an OER is related to metacogni- 
tion. The ability of students to judge how well other people 
learn has been analyzed experimentally in prior works such 
as [12, 3]. However, we are not aware of previous research 
that considers this problem in the large scale of an online 
learning community or how to combine multiple learners’ 
judgments to improve accuracy. In the context of student 
course evaluations, there is evidence that learners may ac- 
tually be poor judges of their teachers’ effectiveness [7, 4]. 


Adaptive online learning communities: Adaptive learn- 
ing communities that decide which resources to serve to stu- 
dents based on up-to-date estimates have generated recent 
interest in the educational data mining and reinforcement 
learning communities. Notable works are by Rafferty, et 
al. [9] and Williams, et al. [17]. In these works, reinforcement 
learning techniques based on bandits and Thompson sam- 
pling were used both to estimate the learning gains of each 
resource and simultaneously to assign resources to learners. 
Our work is complementary: we explore how not only test 
score information, but subjective ratings provided by learn- 
ers, could be useful in estimating the utility of each resource. 


Crowdsourcing for education: In [14], Weld et al. pro- 
vided an overview of how online learning creates challenges 
due to its large scale, but also suggests possible ways in 
which crowdsourcing can offer solutions to these challenges. 
Heffernan, et al. [6] proposes a vision of how crowdsourc- 
ing can help provide important functionality toward adap- 
tive personalized online learning. As one specific instance 
of how the crowd can contribute new resources to an online 
learning community, Williams, et al. showed that people 
on Mechanical Turk can be induced to author novel and 
useful text-based explanations [17]. Whitehill & Seltzer [16] 
showed that Mechanical Turk workers can even create entire 
tutorial videos, at least some of which are effective at help- 
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Figure 2: Gaussian Rating Aggregation Model (GRAM). 
Only the subjective ratings s;; from student i about resource 
j are observed. Latent variable g; expresses the “quality” 
of resource 7 and is used to predict the learning gains of 
students who receive the resource. 


ing students to learn. Peer grading (e.g., Piech, et al. [8]) 
and peer feedback are other ways of harnessing the crowd to 
provide useful feedback for learners at scale. 


Crowdsourcing consensus algorithms: Since Dawid and 
Skene’s seminal work [5] on optimal weighting of annotators’ 
opinions, there have been a slew (e.g., [15, 10, 11, 2, 13, 1]) of 
crowdsourcing models, which are suitable for different kinds 
of tasks (binary, multiple choice, etc.) and capture different 
features of the labeling task (e.g., task difficulty, biases). 


3. GAUSSIAN RATING AGGREGATION 
MODEL (GRAM) 


We model the quality of each open educational resource 
(OER) j with a real number, q;, that can be estimated by 
aggregating over many (real-valued) subjective ratings si; 
from many learners 7. We thus develop a Gaussian proba- 
bility model of how each s;; is related to each q; as well as 
several parameters specific to each learner 7. The model is 
portrayed in Figure 2: Let jy; and 7? be the bias and reliabil- 
ity (variance) of learner 7, respectively. Let q; be the ground- 
truth quality of resource j7. We posit that student i’s label 
8ij for resource 7 is a Gaussian random variable with mean 
qj + wi and variance 77. In other words, if the ground-truth 
quality is q;, then student 7 adds a bias u;, and then adds in- 
dependent 0-mean Gaussian noise with variance 72. We can 
express these relationships using the conditional probability 
density function (PDF) P(sij | aj, mi,97) = N (qi + Mis 77) 
where NV is a Gaussian with a given mean and variance. 


3.1 Inference 

As with many crowdsourcing consensus models, inference in 
the GRAM requires solving a “chicken-and-the-egg” prob- 
lem: if the parameters j1;,7y7 of each learner i were known, 
then an optimal weighting of their votes s;; could be used 
to estimate the quality q; of each resource 7. On the other 
hand, if the ground-truth quality q; of each resource were 
known, then the parameters of each learner could be esti- 
mated. We solve this problem using Expectation-Maximization: 
in the E-Step we compute the PDF of each gq; conditional 
on the parameters {j1:,7:}. In the M-Step, we compute the 
expected joint log-likelihood of the {q;} and {s;;} w.r.t. the 
PDFs computed during the previous E-Step, and then max- 
imize this expectation w.r.t. the parameters {pi,y:}. Since 
the GRAM is Gaussian, both the E- and M-Steps can be 
done analytically, and thus the algorithm is very efficient. 


463 Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 


Let the mean and variance of the prior distribution over each 
q; be ™m and 3’, respectively. Recall that the product of 
two Gaussian PDFs, with means m1 and m2 and variances 
s? and s3%, respectively, is also Gaussian and has a mean 
s*(m1i/s} + m2/s3) and variance s” = (1/s7 + 1/s3)7'. 


E-Step: 
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In other words, the posterior distribution of each q; is a 
Gaussian whose mean is the average of the relevant s;; after 
shifting each one by the learner’s bias 4; and then scaling 
it by 77. We ce achieve a non-informative prior by setting 
the variance 3” to be very high (e.g., 1000). 


-1 


Mi = 


M-Step: We derive the auxiliary function Q as the expec- 
tation, w.r.t. the PDF P computed during the E-Step, of the 
joint log-likelihood of the observed ratings {s;;} and hidden 
ratings {q;}. In the derivation below, C and D are constants 
that do not depend on any of the parameters. 
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where we omitted the constant D in the last line for brevity. 
The two integrals are the first and second plain moments of 
P(q;). The first is the mean of P(q;), i.e, mj. The second 
can be obtained using the fact that the variance V[z] = 
E[x?] — E(x]? for any random variable x. The second plain 
moment is thus m3 + a Hence, 
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We now differentiate with respect to each parameter, set to 
0, and solve: 
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For our experiments we conducted 50 EM iterations. 


3.2 Regularizing the model 

In the full-fledged GRAM, all of the parameters (bias and re- 
liability of each rater) are learned in an unsupervised fashion 
(see Section 3.1). Given enough data, these parameters can 
lead to more accurate estimates of each q;. However, given 
limited data, it can also be useful to regularize the model by 
removing parameters and/or fixing them to known values. 
In fact, if there are too few subjective scores s;; per learner, 
then it is important to remove some parameters because oth- 
erwise the model encounters identifiability problems. Hence, 
we considered several variants of the GRAM: (1) each 7? is 
estimated, but each ju; is fixed to 0; (2) each 4; is estimated, 
but 72 = 1. Finally, we also explored the hypothesis that 
the students with the higher pre-test scores might, perhaps 
due to a higher overall engagement, also be more reliable 
in giving subjective ratings. Hence, we also tried: (4) pi 
is estimated, but y? = 1/./E;[pij] + €, where E;[p;;] is the 
average (over all their assignments) pre-test score pi; of stu- 
dent 7 before receiving resource 7, and € = 0.1 ensures that 
the denominator is positive. 


4. MODELS FOR COMPARISON 

We compared the GRAM to two other models: (1) un- 
weighted average of subjective scores, and (2) prediction 
model trained directly on pre- and post-test scores. 
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4.1 Unweighted average of subjective scores 
Instead of using the GRAM, we can estimate the quality q; 
of each resource j simply as the unweighted average, over all 
learners who rated j, of their subjective rating scores s;;. 


4.2 Average post-test minus pre-test scores 
The primary goal of our paper is to assess to what extent 
subjective scores can estimate the learning gains as mea- 
sured in a pre-test/post-test paradigm. Hence, a strong 
baseline — indeed, a likely upper bound — to which to com- 
pare our GRAM approach is using a prediction model that 
directly uses test scores (on training data) to estimate stu- 
dents’ learning gains (on testing data). In particular, for 
each resource j, we estimate E;[l;;] — the average difference 
between post-tests and pre-tests of all students 7 in the train- 
ing set who received resource j7. We then use this number 
to predict the average learning gains of resource j in the 
test set. Obviously, this requires that the adaptive learning 
system administer pre- and post-tests to learners in order 
to assess each resource’s quality, and this can be much more 
time-consuming than simply asking the learner how much 
she/he likes it. Note that we also considered a prediction 
model that additionally uses students’ pre-test scores as a 
co-variate, which could model possible ceiling effects in the 
tests. However, our results with that model were slightly 
worse, and hence we do not report them. 


5. EXPERIMENT 


To assess how well subjective scores of the resources’ qual- 
ity predicted their associated learning gains, we conducted 
a randomized expeirment on Mechanical Turk. Each partic- 
ipant was paid $1 and could complete up to 3 tasks. In each 
task, the pre- and post-tests were the same, but the learning 
resource was usually different due to random assignment. 


5.1 Overview 

During the task, participants learned about logarithms. Log- 
arithms are a topic that many adults have learned, but many 
have forgotten. The topic is hard enough to induce variabil- 
ity in test scores, but easy enough to be learned (or re- 
freshed) in a short amount of time. The learning resources 
in our experiment comprised a set of tutorial videos on log- 
arithms, most of which were 2-3 minutes long. These re- 
sources were authored by different people around the world 
and collected in a study by Whitehill & Seltzer [16]. Each 
tutorial explains the solution to one of the math problems 
that appeared on the pre-test (see Figure 4). 


To select videos for our experiment, we watched over 100 
candidate tutorial videos collected by [16]. Each video was 
watched by at least one of the investigators and labeled as 
either “High Quality,” “Low Quality,” or “Not Acceptable.” 
Videos labeled as “Not Acceptable” were excluded. To in- 
duce some variability in the quality of videos, we chose one 
“High Quality” video as well as one “Low Quality” for each of 
the Basic Logarithm problems in the pre-test (see Figure 4), 
except for a few problems where only one quality level was 
available. In total, there were m = 17 resources (tutorials) 
that could be assigned; see Figure 3 for examples. 


5.2 Protocol 
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Figure 3: Sample learning resources (tutorial videos on log- 
arithms from [16]) that we used in our experiment. 
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Figure 4: The pre-test on logarithms (borrowed from [16]) 
in our experiment. 


The experiment was built as a web application using HTML 
and Javascript. Each session consisted of multiple phases: 


1. Survey: The participants were first asked some basic 
demographic questions, such as their highest level of 
education, gender, and age. (Note that we did not use 
these data in the analyses in this paper.) 


2. Pre-test: The pre-test surveyed their pre-existing skills 
in three areas: Basic Logarithms, Logarithms and Vari- 
ables, and Equations with Logarithms. 


3. Tutorial video: Participants were then randomly as- 
signed one of the 17 different tutorial videos. 


4. Subjective rating of the resource: On a Likert 
scale of 1 to 5, participants were asked how much they 
agreed with the statement: “This video will help other 
students learn about logarithms.” 


5. Post-test: The post-test contained different math prob- 
lems but was otherwise comparable in format, subject 
matter, and difficulty to the pre-test. 


6. RESULTS AND ANALYSIS 

A total of n = 304 participants completed the task. Of 
these, 239 completed 1 task, 35 completed 2 tasks, and 30 
completed 3 tasks. Figure 5 shows the box plot, for each 
resource (tutorial video) j, of the learning gains associated 
with each resource. There is high variance in learning gains 


within each resource (E;[V;[lij]] averaged over the m = 17 
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Figure 5: Box plot, for each OER (math tutorial video) J, 
of the learning gains 1;; for all users 7 who received it. Re- 
sources are sorted according to their median learning gains 
over all learners who received them. 


videos is 0.03) that dwarfs the variance in average learn- 
ing gains between resources (V;[E;[li;]] is 0.003), where V;[-] 
denotes the variance with respect to learners i and V;[-] 
denotes the variance with respect to resources j. The rela- 
tive magnitudes of these variances makes the prediction of 
average learning gains E,[l,;] of each individual resource a 
challenging task. 


6.1 Are subjective ratings correlated with 


average learning gains? 

From the set of all subjective scores collected in our experi- 
ment, we can aggregate the ratings s;;, using either the pro- 
posed GRAM or simply the unweighted average, for each 
resource j into a quality estimate q;. Similarly, we can 
compute the average learning gains associated with each re- 
source j over all students assigned j to obtain E;[l;;], where 
the subscript indicates that the expectation is w.r.t. all stu- 
dents assigned j. This is equivalent to estimating the av- 
erage treatment effect of resource 7. We then compute the 
correlation (Pearson, Spearman) between these two sets of 
variables. Note that, since many learners in our experiment 
completed only one task, we needed to simplify the GRAM 
in order to avoid identifiability problems (see Section 3.2). 
Hence, instead of the full-fledged GRAM, we used two vari- 
ants: one where each ju; = 0, and one where ju; = 0 and 47 
is determined by the learner’s pre-test score. 


Results are shown in Table 1. Because all correlations 
are estimated within-sample (i.e., there is no separation of 
training and validation data), computing the p-values (two- 
tailed) is straightforward. When the GRAM was used to 
infer only the reliability +? (first line of Table 1), the accu- 
racy is low — 0.15 (Pearson) and 0.13. On the other hand, 
with the other two GRAM variants, when either a bias j1; 
for each labeler is learned, the performance was much bet- 
ter — up to 0.78 (Pearson) and 0.75 (Spearman) between the 
inferred g; and the average learning gains. These results 
are easily better than what is obtained using just the un- 
weighted average of the learners’ ratings. Estimating 7? as 
a function of each learner’s pre-test score did not yield a clear 
accuracy improvement. Altogether, the results suggest that, 
with the right aggregation model, learners’ subjective scores 
carry considerable information about the average learning 
gains of the resources they receive. 


Predicting learning gains within-sample 


Method [Pearson ____| Spearman 


GRAM (learn 77) |] 0.15 (p = 0.56) | 0.13 (p = 0.63) 
GRAM (learn j1:) | 0-78 (p < 0.001) | 0.70 (p = 0.002) 
GRAM (learn pu, 


set 7} from pretest) | 0.76 (p < 0.001) 


Unweighted average |] 0.38 (p = 0.14) 


0.75 (p < 0.001) 
0.54 (p = 0.03) 


Table 1: Accuracies, and associated p-values, of different 
models when predicting the average learning gains E;|[J;,] 
of the resources from subjective ratings reported by learn- 
ers. For aggregating learners’ subjective ratings, we consider 
both the unweighted average as well as the quality scores in- 
ferred using the GRAM. 


6.2 Do subjective ratings predict the average 


learning gains for new students? 
Suppose some new students enter the adaptive learning com- 
munity. How accurately can we predict the average learning 
gains E,[l,;] of a resource j for these learners? How does this 
accuracy compare to that of a prediction model in which we 
estimate the effectiveness of each resource directly based on 
pre- and post-test data? 


We conducted 3-fold cross-validation, where the same stu- 
dents never appear in more than one fold. From the training 
data in each fold, we use GRAM to infer the latent variables 
q; from the subjective scores s;;; we use the variant in which 
only pi; is learned. We then compute the correlation (Pear- 
son, Spearman) between q; and the average learning gains 
of resource j over all students 7 in the test set who received 
j. Due to the high variability in results over the 3 folds, 
we repeated the 3-fold cross-validation 30 times, and aver- 
aged the results over trials. In each trial, we ensured that 
the data were randomly partitioned such that every resource 
was assigned to at least 1 learner in at least 2 folds (i.e., one 
testing fold and one training fold). 


In the cross-validation framework, computing p-values is not 
straightforward because the estimates from each fold are not 
statistically independent. Instead, we estimated the uncer- 
tainty of each correlation as the average (over the 30 trials) 
standard error (i.e., the standard deviation of the correla- 
tions over the K = 3 folds, divided by VK ). We compare 
the accuracy of predictions obtained with the GRAM to the 
predictions by the unweighted average model (Section 4.1), 
and also to the predictions from a model that has direct 
access to the test scores (see Section 4.2). The latter is a 
strong comparison because it has access to actual pre- and 
post-test scores, whereas the other models do not. 


Results are shown in Table 2. The GRAM — which utilizes 
only subjective scores, not test results, of the training data — 
is able to predict the average learning gains for new learners 
with higher accuracy (0.49 Pearson and 0.43 Spearman cor- 
relation) compared to the model that uses pre- and post-test 
data (0.36 Pearson and 0.41 Spearman correlation) to esti- 
mate the quality of each resource. Even the unweighted av- 
erage of learners’ subjective ratings retains most of the pre- 
diction accuracy that could be achieved using explicit pre- 
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Predicting learning gains for new students 
Method | Pearson [| Spearman 
0.11) 
0.11) 
t0.08) 


Table 2: Accuracies (+ their standard errors) over K = 3 
cross-validation folds, of different models when predicting 
the average learning gains E,[l;;] of new learners (i.e., not 
used for training). 


and post-test score data. All in all, our results suggest that 
(1) learners’ subjective ratings carry considerable informa- 
tion that could be useful in an adaptive learning community 
for deciding which resources are more effective than others, 
and (2) using a crowdsourcing consensus model such as our 
proposed GRAM can potentially yield higher accuracy than 
simply taking the unweighted average. 


7. CONCLUSION 


We investigated whether learners’ subjective opinions about 
the quality of learning resources (e.g., a tutorial video) are 
correlated with the learning gains (post-test minus pre-test) 
associated with receiving those resources. This could have 
implications for adaptive online learning communities in which 
open educational resources (OER) are served to students 
based on estimates of how effective they would be for learn- 
ing: Rather than giving relatively time-consuming pre- and 
post-tests, the adaptive learning platform could instead sim- 
ply ask learners how helpful they found the resources to be. 
We developed a novel Gaussian Rating Aggregation Model 
(GRAM) with which to aggregate many learners’ subjec- 
tive scores into an overall quality estimate for each resource. 
Based on an experiment that we conducted on Mechani- 
cal Turk, we found that (1) subjective scores are highly 
correlated with average learning gains (Pearson correlation 
of 0.78). Moreover, (2) when predicting the average learn- 
ing gains for learners who are new to the learning commu- 
nity, the accuracy (Pearson correlation of 0.49) using the 
GRAM from subjective scores was even better than esti- 
mating learning gains from test scores. 


Future work will consider how to combine subjective scores 
with test data in order to arrive at improved estimation ac- 
curacy of resources’ effectiveness. Moreover, with the goal to 
personalize education, it would be interesting to explore how 
to harness subjective ratings to estimate individual learning 
gains rather than just average learning gains. Finally, it is 
important to establish whether the results we collected in 
our study on adult participants from Mechanical Turk gen- 
eralizes to more authentic online learning communities (e.g., 
Khan Academy, ASSISTments). 
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